Why are nonlinear fits so challenging? 
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Fitting model parameters to experimental data is a common yet often challenging task, especially 
if the model contains many parameters. Typically, algorithms get lost in regions of parameter space 
in which the model is unresponsive to changes in parameters, and one is left to make adjustments 
by hand. We explain this difficulty by interpreting the fitting process as a generalized interpolation 
procedure. By considering the manifold of all model predictions in data space, we find that cross 
sections have a hierarchy of widths and are typically very narrow. Algorithms become stuck as 
they move near the boundaries. We observe that the model manifold, in addition to being tightly 
bounded, has low extrinsic curvature, leading to the use of geodesies in the fitting process. We 
improve the convergence of the Levenberg-Marquardt algorithm by adding geodesic acceleration to 
the usual step. 



The estimation of model parameters from experimen- 
tal data is astonishingly challenging. A nonlinear model 
with tens of parameters, fit (say) by least-squares to ex- 
perimental data, can take weeks of hand-fiddling before 
a qualitatively reasonable agreement can be found; even 
then, the parameters cannot usually reliably be extracted 
from the data. Both general minimization algorithms and 
algorithms like Levenberg-Marquardt that are designed 
for least-squares fits routinely get lost in parameter space. 
This becomes a serious obstacle to progress when one is 
unsure of the validity of the model, e.g. in systems biology 
where one wants to automatically generate and explore 
a variety of alternative models. 

Here we use differential geometry to explain why fits 
are so hard. We first explore the structure of the model 
manifold M , the manifold of predictions embedded in the 
space of data, D, and find that it is typically bounded, 
with cross sections having a hierarchy of widths, so that 
the overall structure is similar to that of a long, thin 
ribbon. We explain this hierarchy by viewing the fitting 
process as a generalized interpolation procedure with few 
effective model degrees of freedom. We interpret the dif- 
ficulty in fitting to be due to algorithms getting stuck 
near the boundary of M, where the model is unrespon- 
sive to variations in the parameters. We then discuss how 
geometry motivates algorithms to alleviate this difficulty. 

A typical nonlinear least squares problem fits a model 
Y m (0) with N parameters 9 to M experimental data 
points y m . We define the model manifold, M, as the 
parametrized TV-dimensional surface Y{9) embedded in 
Euclidean data space, D = R M . The best fit to the 
experiment is given by the point on M. with Euclidean 
distance closest to the data, minimizing the cost C — 
\{Y{9) — y) 2 . The Euclidean metric of data space (with 
distance between models given by the change in resid- 
uals, r = Y(9) — y) induces a metric on the manifold, 
SV = d,jY ■ d v Y = (jTj)^, where J mf i = ^-Ym\ g^v 
is known as the Fisher Information matrix. As an exam- 



Figure 1: The model manifold for the two-exponential prob- 
lem, with yi evaluated at t = 1/3, 1, and 3. Boundaries exist 
when # M = 0, oo and when 9\ = Bi. (The ribbon-like structure 
of Fig. [2]emerges only in higher dimensions.) 



pie, the model Y(t, 9) = f e (t) = + e"" 2 ' sampled at 
three time points is given in Fig. [TJ The model manifold 
has been extensively studied by the information geometry 
statistics community but they focus on the intrinsic 
curvature; as the cost is the distance in data space, the 
embedding and its extrinsic curvature are crucial to find- 
ing best fits 0,0]. 

As seen in figs. El and O this model manifold can take 
the form of a hyper-ribbon, with thinnest direction four 
orders of magnitude smaller than the long axes. To 
understand this observed hierarchy, consider the special 
case of analytic models, f(t,9), of a single independent 
variable (time) where the data points are Y m = f(t m ). 
Let R be the typical time scale over which the model be- 
havior changes, so that the n th term of the Taylor series 
f [n) (t)/n\ < R~ n (roughly the radius of convergence) . 
If the function is sampled at n time point (ti,£2> •••>*») 
within this time scale, the Taylor series may be approxi- 
mated by the unique polynomial of degree n—1, P n -i(t) 
passing through these points. At a new point, to, the 
discrepancy between the interpolation and the function 
is given by 
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Figure 2: Top: Two views of the cross section of the 
model manifold for an infinite sum of exponentials fA,e(i) = 
^2 n A n exp(—9„t) with A n > 0, given by fixing F(0) = 1 
and F(l) = 1/e. Bottom: The range of allowed fits (grey) 
is strongly reduced by fixing the output at t = 1/2 to the 
midpoint of its range (black). 



where £ lies somewhere in the interval containing 
to, ti, t n Q. The polynomial u> n (t) has roots at each 
of the interpolating points uj n (t) = (t—t 1 )(t~t 2 )---(t—t n ). 
The possible error of the interpolation function bounds 
the allowed range of behavior, A/„, of the model at to 
after constraining the nearby n data points (i.e. cross 
sections). Consider the ratio of successive cross sections, 

= (t - t n+l ){n + 1)7^7- If n is sufficiently 

large, then (n + l)^pr^?y ~ jj; therefore, we find that 

A /j +1 ~ t ~ t R +1 < 1 by the ratio test. Each cross sec- 
tion is thinner than the last by a roughly constant factor, 
yielding the observed hierarchy. 

We argue that this hyper-ribbon structure will be 
shared with a wide variety of nonlinear, multiparame- 
ter models. Note that the eigenvalues of the metric ten- 
sor <7 M „ in Fig. [3] also form a hierarchy, spanning eight 
orders of magnitude - this 'sloppiness' has been docu- 
mented in a number of other models, including seven- 
teen in systems biology 0L insect flight and variational 
quantum wave functions [fij] , interatomic potentials Q, 
and a model of the next-generation international linear 
collider ||. (Multiparameter models whose parameters 
are individually measured by the data, and on the other 
extreme models with sensitive, chaotic dependence on 
parameters, will likely not fall into this family.) Most 
parameters in these models have bounded effects - they 
can be set to limiting values (zero, infinity, etc.) and 
still have finite model predictions - rates and Michaelis- 
Menten coefficients in systems biology, Jastrow and de- 
terminential factors in variational wave functions, etc. 
Note that the widths of the model manifold track nicely 
with these eigenvalues in Fig. [3] (taking the square root 
to match units): moving parameter combinations along 
eigendirections of the metric by an order of magnitude 
(fixed shift in log parameters) exhausts the range of be- 
havior (width). This tracking suggests that the ubiquity 
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Figure 3: Geodesic cross-sectional widths of an eight dimen- 
sional model manifold along the eigendirections of the metric 
from some central point, together with the square root of the 
eigenvalues (singular values of the Jacobian), the inverse ex- 
trinsic (geodesic) curvature K [af and the inverse geodesic 
parameter-effects curvature K p [a. I3L Hcl|. Notice the hierar- 
chy of these data-space distances - the widths and singular 
values each spanning around four orders of magnitude and 
the curvatures covering eight. Note also that the extrinsic 
curvatures are three orders of magnitude smaller than the 
parameter-effects curvature. 



of sloppy eigenvalue spectra at best fits implies a ubiqui- 
tous hyper-ribbon structure for the model manifold. 

Our observation that many models are sloppy, pre- 
sumably sharing this hyper-ribbon model manifold struc- 
ture, is now explained: multiparameter models are a kind 
of high-dimensional analytic interpolation scheme, and 
near degenerate Hessians result whenever multiple data 
points reside within some generalized radius of conver- 
gence. When this is the case, the data points are highly 
correlated and the model has few effective degrees of free- 
dom. Whenever there are many model parameters for 
each effective degree of freedom there will be a hierarchy 
of widths and the model will be sloppy. 

Our geometric interpretation explains a number of ob- 
servations about nonlinear models. First, although pa- 
rameters cannot be reliably extracted by fitting degener- 
ate models, it is still possible to constrain the outcome 
of new experiments Because the fitting process is 

an interpolation scheme, only a few stiff parameter com- 
binations need to be tuned to fit most of the data, since 
only a few data points constrain the predictions at other 
times. The remaining unconstrained parameter combina- 
tions control the interpolated values, which are already 
restricted by the analyticity of the model. 

Figure 05 also shows that the parameter effects curva- 
ture |2, 15 TL0| and the geodesic extrinsic curvatures vary 
over twice as many decades as the widths and -\/A; in- 
deed, their formulas include a factor of 1/A|23. Why is 
the extrinsic curvature so much smaller? The manifold 
has zero extrinsic curvature if there are equal numbers 
of parameters as data points, N = M (where the model 
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Figure 4: Geodesies can be used to construct polar coordi- 
nates on M (above). In these new geodesic coordinates, the 
cost contours are nearly perfect, isotropic circles (below). 



manifold is a sub-volume in the Euclidean data space). 
We have seen that most of the data points are interpo- 
lations that supply little new information; the extrinsic 
curvature will be small when the effective dimensional- 
ity of the embedding space is not larger than than the 
number of parameters. [24] 

We use geodesies to construct new polar coordinates 
7 M = 7 M (0) on M. that generalize Riemann normal coor- 
dinates [12J]. Since geodesies are nearly straight lines in 
data space on M. , we find that cost contours in these co- 
ordinates are nearly quadratic and isotropic around the 
best fit, as explicitly computed in Fig.[H Nonlinear mod- 
els locally look like linear models with badly chosen pa- 
rameters, well beyond the harmonic approximation. 

Can these nearly straight geodesies inspire algorithms 
that lead efficiently to the best fit? Integrating the 
geodesic equation with a single Euler step reproduces the 
Gauss-Newton step (56^ — —g^'VyC in our notation), 
ineffective due to the large eigenvalues in the inverse met- 
ric g^; even geodesic motion hits the boundaries of M. 
The ribbon is nearly flat, but very thin; the geodesies hit 
the edges long before finding a good fit. 

To improve convergence, we can modify the model 
manifold to remove the boundaries. One method of do- 
ing this is to introduce the model graph Q, which is the 
TV-dimensional parametric surface drawn by the model 
embedded in data space crossed with parameter space. 
Since most boundaries occur at infinite parameter val- 
ues, the model graph Q 'stretches' these boundaries to 
infinity in the parameter space portion of the embed- 
ding. The metric for the model graph is an interpo- 
lation of the data space and parameter space metrics, 
9V = 9uv + A*/ M „, where A* determines the weight of 
the two spaces. Notice that the eigendirections of the 
metric are the same on both M. and Q\ however, the 



Algorithm 


Success Rate 


Mean njev 


Mean nfev 


Traditional LM + accel 


65% 


258 


1494 


Traditional LM 


33% 


2002 


4003 


Trust Region LM 


12% 


1517 


1649 


BFGS 


8% 


5363 


5365 



Table I: The results of several algorithms applied to a test 
problem of fitting a sum of four exponential terms (varying 
both rates and amplitudes) in log-parameters (to enforce pos- 
itivity). Initial conditions are chosen near a manifold bound- 
ary with a best fit of zero cost near the center of the manifold. 
Among successful attempts, we further compare the average 
number of Jacobian and function evaluations needed to ar- 
rive at the fit. Success rate indicates an algorithm's ability 
to avoid the manifold boundaries (find the canyon from the 
plateau), while the number of Jacobian and function evalua- 
tions indicate how efficiently it can follow the canyon to the 
best fit. BFGS is a quasi newton scalar minimizer of Broy- 
den, Fletcher^ Goldfarb, and Shanno (BFGS) [H, El- The 
traditional EH and trust region [lj] implementations of 
Levenberg-Marquardt consistently outperform this and other 
general optimization routines on least squares problems, such 
as Powell, simplex, and conjugate gradient. Including the 
geodesic acceleration on a standard variant of Levenberg- 
Marquardt dramatically increases the success rate while de- 
creasing the computation time. 



eigenvalues on the graph are given by Xg = \m + A*. 
Therefore, the degenerate eigendirections with Xm *C A* 
have eigenvalues Xg « A* on the model graph; A* cuts off 
the small eigenvalues of the Hessian. The analogy of the 
Gauss-Newton step on the model graph is the well-known 
Levenberg-Marquardt stev^SO = — ( J T J + X*!) 1 VC 
in our notation (l3l . H3 . Ha ]. By dynamically adjust- 
ing A*, the algorithm can shorten its step, removing the 
danger of the degenerate Hessian, while rotating from 
the Gauss-Newton direction into the steepest descents 
direction. Geometrically we understand the superiority 
of Levenberg-Marquardt to be due to the lack of bound- 
aries of the model graph. 

Inspired by the results in Fig. IU we further improve 
the standard Levenberg-Marquardt algorithm. Interpret- 
ing the Levenberg-Marquardt step as a velocity, t> M = 
—g^VvC, where g is the metric on the model graph, the 
geodesic acceleration (giving the parameter-effects curva- 
ture) is given by a M = —g^ v d v y- d a dpy v a v@, giving a 
step 89^ — w M + ia M . The geodesic acceleration is very 
cheap to calculate, requiring only a directional second 
derivative, which can be estimated from three (cheap) 
function evaluations (one or two additional function eval- 
uations) at each step with no extra (expensive) Jaco- 
bians. The geodesic acceleration serves two purposes. 
First, it provides an estimate for the trust region in which 
the linearization approximation (from which Levenberg- 
Marquardt is traditionally derived) is valid. At each step, 
we adjust Auntil the acceleration is smaller than the ve- 
locity, which we find is more effective at avoiding model 
boundaries than either tuning until a downhill step is 
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found [H| or considering the reduction ratio The 
second benefit of the acceleration occurs when the algo- 
rithm must follow a long narrow canyon to the best fit. 
In these scenarios convergence may be sped up by ap- 
proximating the path with a parabola instead of a line. 
The utility of the geodesic acceleration is seen in Table 
U, where the performance of several algorithms on a test 
problem is summarized. More extensive comparisons and 
further refined algorithms are in preparation [18]. 

Just as the special sum of squares form of the cost 
function gives an approximate Hessian using only first 
derivatives of the residuals, H w J T J, it has (together 
with the low extrinsic curvature) allowed us to approx- 
imate the cubic correction using only a directional sec- 
ond derivative. All other algorithms that seek to im- 
prove the Levenberg-Marquardt algorithm use second 
derivative information only to calculate the correction 
SHpv = r- df.duf, to the Hessian [H, El E0, HI ■ This 
correction is negligible if the nonlinearities are primarily 
parameter-effects curvature; since the unfit data is nearly 
perpendicular to the surface of the model manifold while 
the nonlinearities are tangent to the model manifold, the 
dot product vanishes. Qualitatively, this means that the 
approximate Hessian is very accurate and that the bend- 



ing of the local ellipses, (due to the third order terms in 
the cost that we consider) are the most important cor- 
rection. 

By interpreting the fitting process as a generalized in- 
terpolation scheme, we have seen that the difficulties in 
fitting are due to the narrow boundaries on the model 
manifold, M. These boundaries form a hierarchy of 
widths dual to the hierarchy of Hessian eigenvalues char- 
acteristic of nonlinear model fits. Additionally, we both 
observe and argue that the model manifold is remark- 
ably flat (low extrinsic curvature), which leads us to the 
use of geodesies in the fitting process. The modified 
Gauss-Newton and Levenberg-Marquardt algorithms are 
understood to be Euler approximations to the geodesic 
equation on the model manifold and model graph respec- 
tively. The geodesic acceleration improves convergence 
of Levenberg-Marquardt by providing a more accurate 
trust region while reducing computation time. Data fits 
are both practically important and theoretically elegant. 
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